ML Course - Introduction to Machine Learning

Author

Alexandre Bry

1 Definitions

1.0.1 Machine Learning (ML)

Machine Learning (ML)

Feeding data into a computer algorithm in order to learn patterns and make predictions in new and different situations.

ML Model

Computer object implementing a ML algorithm, trained on a set of data to perform a given task.

ML is really about learning to make extrapolations on a given task.

1.1 Categories of ML

1.1.1 Type of dataset

The four main categories of ML are based on the type of dataset used to train the model:

Supervised Unsupervised Semi-supervised Reinforcement
Input Data Data Data Environment
Ground-truth Yes No Partial No (reward)
Examples Classification, Regression Clustering Anomaly detection Game playing
  • Supervised: for each input in the dataset, the expected output is also part of the dataset
  • Unsupervised: for each input in the dataset, the expected output is not part of the dataset
  • Semi-supervised: only a portion of the inputs of the dataset have their expected output in the dataset
  • Reinforcement: there is no predefined dataset, but an environment giving feedback to the model when it takes actions

1.1.2 Type of output

Another way to categorize ML models is based on the type of output they produce:

Category Description Example Outputs Example Use Cases
Classification Assign one (or multiple) label(s) chosen from a given list of classes to each element of the input. “Cat”, “Dog”, “Bird” Spam detection, Image recognition
Regression Assign one (or multiple) value(s) chosen from a continuous set of values. 3.5, 7.2, 15.8 Stock price prediction, Age estimation
Clustering Create categories by grouping together similar inputs. Cluster 1, Cluster 2 Customer segmentation, Image compression
Anomaly Detection Detect outliers in the dataset. Normal, Outlier Fraud detection, Fault detection
Generative Models Generate new data similar to the training data. Image, Text, Audio Image generation, Text completion
Ranking Arrange items in order of relevance or importance. Rank 1, Rank 2, Rank 3 Search engine, Recommendation system
Reinforcement Learning Learn a policy to maximize long-term rewards through interaction with an environment. Policy, Action sequence Game playing, Robotics control
Dimensionality Reduction Reduce the number of features while retaining meaningful information. 2D or 3D projection Visualization, Data compression

1.2 Dataset

1.2.1 Dataset - Definition

Dataset

A collection of data used to train, validate and test ML models.

1.2.2 Dataset - Example

Dataset example

sepal length sepal width petal length petal width species
0 6.7 2.5 5.8 1.8 2
1 7.1 3.0 5.9 2.1 2
2 6.5 2.8 4.6 1.5 1
... ... ... ... ... ...
147 6.5 3.0 5.2 2.0 2
148 6.5 3.0 5.8 2.2 2
149 6.1 2.9 4.7 1.4 1

150 rows × 5 columns

Iris dataset example

1.2.3 Content - Definitions

Instance (or sample)

An instance is one individual entry of the dataset (a row).

Feature (or attribute or variable)

A feature is a piece of information that the model uses to make predictions.

Label (or target or output or class)

A label is a piece of information that the model is trying to predict.

Feature vs. Label

Features and labels are simply different columns in the dataset with different roles.

1.2.4 Content - Example

Instances, features and labels

Feature 1 Feature 2 Feature 3 Feature 4 Label
Instance 0 6.7 2.5 5.8 1.8 2
Instance 1 7.1 3.0 5.9 2.1 2
Instance 2 6.5 2.8 4.6 1.5 1
... ... ... ... ... ...
Instance 147 6.5 3.0 5.2 2.0 2
Instance 148 6.5 3.0 5.8 2.2 2
Instance 149 6.1 2.9 4.7 1.4 1

150 rows × 5 columns

Instance, feature and label in the Iris dataset

1.2.5 Subset

Dataset subsets

A ML dataset is usually subdivided into three disjoint subsets, with distinctive role in the training process:

  • Training set: used during training to train the model,
  • Validation set: used during training to assess the generalization capability of the model, tune hyperparameters and prevent overfitting,
  • Test set: used after training to evaluate the performance of the model on new data it has not encountered before.

Metaphor of studies: exercises, past years exams and real exam

2 Overview of ML methods

2.1 Supervised Learning

2.1.1 Decision Trees - Definition

A tree-like structure used for both classification and regression

Simple decision tree on the Iris dataset

Simple decision tree on the Iris dataset

2.1.2 Decision Trees - Example

Decision boundaries of a decision tree on the Iris dataset

Decision boundaries of a decision tree on the Iris dataset

2.1.3 Random Forests - Definition

An ensemble method that combines multiple decision trees:

  • Train independently \(B\) trees using:
    • Bagging: each tree is fitted on a random subset of the training set
    • Feature bagging: each split in the decision tree (i.e. each node) is chosen among a subset of the features
  • Take a decision by aggregating individual decisions of each tree

It can be seen as answering a question by asking multiple people and taking the most common answer. To improve the results, we take people with:

  • Different knowledge (bagging)
  • Different tools (feature bagging)

2.1.4 Random Forests - Example

Decision boundaries of a random forest on the Iris dataset

Decision boundaries of a random forest on the Iris dataset

2.1.5 Support Vector Machines (SVM) - Definition

Used for classification and regression, effective in high-dimensional spaces:

  • Separate the feature space using optimal hyperplanes
  • Features are mapped in a higher dimensional space to allow to fit non-linearly in the original feature space

SVM: map features to a higher dimensional space to be able to separate classes using a hyperplane[1]

SVM: map features to a higher dimensional space to be able to separate classes using a hyperplane[1]

The kernel trick is a way to map the features in a higher dimensional space without actually computing the new features.

2.1.6 Support Vector Machines (SVM) - Example

Decision boundaries of a SVM with the RBF kernel on the Iris dataset

Decision boundaries of a SVM with the RBF kernel on the Iris dataset

2.1.7 Other methods

Other methods for supervised learning include:

Method Description
Linear Regression Predicts a continuous value with a linear model
Logistic Regression Predicts a binary value with a linear model
K-Nearest Neighbors (KNN) Non-parametric method for classification and regression
Boosting Ensemble method (like Random Forests) that combines weak learners to form a strong model
Naive Bayes Probabilistic classifier based on Bayes’ theorem

2.2 Unsupervised Learning

2.2.1 K-Means Clustering

A method for partitioning data into \(k\) clusters:

  • \(k\) must be chosen a priori
  • The principle is to start with random centroids and then iteratively:
    1. Classify points using the closest centroid
    2. Move each centroid to the real centroid of its class
  • Classical method with lots of variation

K-Means Clustering convergence process[2]

K-Means Clustering convergence process[2]

2.2.2 Hierarchical Clustering

Builds a hierarchy of clusters using either agglomerative or divisive methods:

  • Build a full hierarchy top-down (divisive) or bottom-up (agglomerative)
  • Create any number of clusters by cutting the tree

Hierarchical Clustering[3]

Hierarchical Clustering[3]

2.2.3 Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Clustering based on the density of data points:

  • Divides points in 4 categories: core points (in red below), directly reachable (yellow), reachable and outliers (blue)
  • Only two parameters: radius size (\(\epsilon\)) and number of neighbors to be core (\(min_{pts}\))

DBSCAN Illustration[4]

DBSCAN Illustration[4]

DBSCAN Result[5]

DBSCAN Result[5]

2.2.4 Other methods

Other methods for unsupervised learning include:

Method Description
Gaussian Mixture Models (GMM) Probabilistic clustering assuming data is generated from multiple Gaussian distributions
Autoencoders Neural networks that learn efficient representations of data in an unsupervised manner
Self-Organizing Maps (SOM) Neural network-based method for clustering and visualization

2.3 Dimensionality Reduction

2.3.1 PCA - Definition

PCA (Principal Component Analysis) is a dimensionality reduction technique to project data into lower dimensions:

  • Project data into a space of lower dimension
  • Keep as much variance (so as much information) as possible

PCA Illustration[6]

PCA Illustration[6]

2.3.2 PCA - Example

PCA on the Iris dataset

PCA on the Iris dataset

2.3.3 t-SNE - Definition

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction technique primarily used for visualization of high-dimensional data

t-SNE Result on MNIST dataset[7]

t-SNE Result on MNIST dataset[7]

2.3.4 t-SNE - Example

t-SNE on the Iris dataset

t-SNE on the Iris dataset

3 Neural Networks

3.1 Definitions

3.1.1 Neural Network (NN)

Neural Network (NN)

Subtype of ML model inspired from brains. Composed of several interconnected layers of nodes capable of processing and passing information.

Deep Learning (DL)

Subcategory of Machine Learning. Consists in using large NN models (i.e. with a high number of layers) to solve complex problems.

Common representation of a neural network

Common representation of a neural network

3.1.2 Structure

Structure of a NN

  • Neuron: Base element. Takes multiple inputs, sums them with weights and passes the result as output.
  • Layer: Set of similar neurons taking different inputs and/or having different weights.
  • Structure: Ensemble of layers composing the model, including their types, sizes, succession and parameters.

3.2 Types of Layers

3.2.1 Dense (or Fully Connected) - Diagram

The most basic layer, in which each output is a linear combination of each input (before the activation layer).

Fully connected layer

Fully connected layer

3.2.2 Dense (or Fully Connected) - Math

The output of a dense layer is computed as follows:

\[ \require{colortbl} \underbrace{ \left[ \begin{array}{c} \rowcolor{#ffd077} x_1 & x_2 & x_3 & x_4 \end{array} \right] }_{\text{Input}} \cdot \underbrace{ \left[ \begin{array}{c} \columncolor{#ffd077} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \\ w_{41} & w_{42} & w_{43} \end{array} \right] }_{\text{Weights}} = \underbrace{ \begin{bmatrix} \columncolor{#ffd077} y_1 & y_2 & y_3 \end{bmatrix} }_{\text{Output}} \]

with \(y_1 = x_1 w_{11} + x_2 w_{21} + x_3 w_{31} + x_4 w_{41}\)

3.2.3 Convolutional - Diagram

A layer combining geographically close features, used a lot to process rasters.

Convolutional layer - real input shape

Convolutional layer - real input shape

Convolutional layer - easier to visualise

Convolutional layer - easier to visualise

3.2.4 Convolutional - Math

The output of a convolutional layer is computed as follows:

\[ \require{colortbl} \underbrace{ \left[ \begin{array}{c} \cellcolor{#ffd077}x_{11} & \cellcolor{#ffd077}x_{12} & x_{13} \\ \cellcolor{#ffd077}x_{21} & \cellcolor{#ffd077}x_{22} & x_{23} \\ x_{31} & x_{32} & x_{33} \end{array} \right] }_{\text{Input}} \ast \underbrace{ \left[ \begin{array}{c} \cellcolor{#ffd077}w_{11} & \cellcolor{#ffd077}w_{12} \\ \cellcolor{#ffd077}w_{21} & \cellcolor{#ffd077}w_{22} \end{array} \right] }_{\text{Kernel/Weights}} = \underbrace{ \begin{bmatrix} \cellcolor{#ffd077}y_{11} & y_{12} \\ y_{21} & y_{22} \end{bmatrix} }_{\text{Output}} \]

with \(y_{11} = x_{11} w_{11} + x_{12} w_{12} + x_{21} w_{21} + x_{22} w_{22}\)

3.2.5 Recurrent

Type of layers designed to process sequential data such as text, time series data, speech or audio. Works by combining input data and the state of the previous time step.

The two main variants of recurrent layers are:

Long Short-Term Memory (LSTM)[8]

Long Short-Term Memory (LSTM)[8]

Gated Recurrent Unit (GRU)[9]

Gated Recurrent Unit (GRU)[9]

Nowadays, transformer architectures are however preferred to process sequential data.

3.2.6 Pooling

A type of layers used to reduce the number of features by merging multiple features into one. There are multiple kinds of pooling layers, the most simple ones being Maximum Pooling and Average Pooling.

Max Pooling Example[10]

Max Pooling Example[10]

3.2.7 Other Types

Type Description
Residual Skips some layers to improve training
Attention Focuses on specific parts of the input data
Embedding Transforms discrete data into continuous vectors
Dropout Randomly drop out some of the nodes during training to reduce overfitting
Batch Normalization Normalizes the input of each layer across the batch to improve training stability and speed
Layer Normalization Normalizes the input of each layer across the features to improve training stability and speed
Embedding Transforms discrete input data into continuous vectors with lower-dimensional space
Flatten Convert multi-dimensional data into 1D data that can be fed into fully connected layers

3.3 Architecture and Weights

3.3.1 Both are Crucial

A NN is defined by both its architecture and its weights

  • Architecture: The structure of a NN, including the type and number of layers, their size and succession.
  • Weights: The values used to sum the inputs of each node in the network. They are learned during the training process.

3.3.2 Both are Crucial - Without Weights

Architecture without weights is useless

Architecture without weights is useless

3.3.3 Both are Crucial - Without Architecture

Two architectures with the same weights but doing very different things:

A fully connected architecture

A fully connected architecture

A convolutional architecture

A convolutional architecture

3.3.4 Both are Crucial - Conclusion

Both the architecture and the weights are needed

When importing a model, you need to import both the architecture and the weights. The process varies depending on the library used.

This also means that the having access to the architecture is not enough to reproduce the model.

3.4 Training Process

3.4.1 Objective

What do we want?

We want a model that performs well on a given task. To achieve this, we need to:

  1. Select the right architecture for the model (not covered in these slides)
  2. Find the right weights for the model

3.4.2 Right Weights

What are the right weights?

The right weights are the ones that allow the model to perform well on the task. More precisely:

  • Performing well means minimizing a loss function
  • On the task means on the dataset that we have for the task

Loss function

Function that evaluates the performance of a model. In supervised learning, it compares the predictions of the model to the ground-truth.

3.4.3 General Idea

General idea

Since there is a huge number of possible combinations of weights, we need to search for the right ones. This process is called training. It is iterative and consists of the repetition of three steps:

  1. Forward pass: compute the output of the model on the training data
  2. Backward pass: compute the gradient of the loss function with respect to the weights
  3. Optimization: update the weights using the gradient

The process is repeated until the model performs well on a validation set.

3.4.4 Animation - What Happens

Animation of the training process

Animation of the training process

3.4.5 Animation - What We Really Know

Animation of the training process without the known loss function

Animation of the training process without the known loss function

3.4.6 Animation - Sometimes It Works Well

It works well

It works well

It works well

It works well

It works

It works

3.4.7 Animation - Sometimes It Doesn’t Work

It doesn’t work

It doesn’t work

It doesn’t work

It doesn’t work

It doesn’t work

It doesn’t work

4 Project Pipeline

4.1 Overview

  1. Data acquisition
  2. Data preprocessing
  3. Model selection
  4. Model optimization/training
  5. Model evaluation
  6. Go back to 3 if not satisfied
  7. Final model training

4.2 Data Preprocessing

4.2.1 Different Issues

Multiple sources of issues and steps to perform:

  1. Handle different formats
  2. Remove outliers (mostly for raw data)
  3. (Optionally) extract features
  4. Handle missing data
  5. Normalize

4.2.2 Why Normalization?

No prior knowledge

A priori all features have the same importance, so none of them should have an advantage. Therefore, having features with larger values than others would be detrimental.

Usually, all features are individually normalized over the whole dataset, to obtain a distribution with an average of 0 and a standard deviation of 1:

\[ \begin{align*} \hat{X} & = \sum\limits_{j=0}^n X_j \\ \sigma_X & = \sum\limits_{j=0}^n (X_j - \hat{X})^2 \\ \forall k \in [0, \cdots, n ], X_k & = \frac{X_k - \hat{X}}{\sigma_X} \end{align*} \]

4.3 Model Selection

  • Type of model (ML, NN, DL, …)
  • Complexity:
    • Number of features
    • Type of output
    • Size of the layers (for NN)
    • Number of layers (for NN)

4.4 Model Optimization / Training

  • Loss selection: depends on the task, the objectives, the specific issues to solve
  • Training process selection (lots of different tweaks and improvements can be implemented in NN training)
  • Hyperparameter tuning, by repeatedly:
    • Selecting one or multiple configurations of hyperparameters
    • Training the model one or multiple times
    • Determining the best hyperparameters

4.5 Model Evaluation

4.5.1 Criteria - Classification

Most common evaluation criteria for classification tasks:

Name Use case Formula
Accuracy Balanced datasets \(\frac{TP + TN}{TP + TN + FP + FN}\)
Precision False positives are costly \(\frac{TP}{TP + FP}\)
Recall False negatives are costly \(\frac{TP}{TP + FN}\)
F1-Score Unbalanced class distribution \(\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\)

4.5.2 Criteria - Regression

Most common evaluation criteria for regression tasks:

Name Use case Formula
Mean Absolute Error (MAE) Robust to outliers \(\frac{1}{n} \sum\limits_{i=1}^n |y_i - \hat{y}_i|\)
Mean Square Error (MSE) Sensitive to large errors \(\frac{1}{n} \sum\limits_{i=1}^n (y_i - \hat{y}_i)^2\)

4.5.3 Cross-validation

Cross-validation

Method to estimate the real performance of the model:

  1. Split the dataset in multiple parts (usually 5)
  2. Train and evaluate the model for different combinations of these parts (usually 5)

Cross-validation[11]

Cross-validation[11]

4.6 Final Model Training

Once the data is preprocessed, the model is selected, the hyperparameters chosen and optimized, the final model can be trained (potentially multiple times to keep the best one).

5 Challenges

5.1 Data

5.1.1 Quality

Quality of the data is obviously crucial to train well-performing models. Quality encompasses multiple aspects:

  • Raw data quality: the input data must possess high enough details for the task to be even achievable. Be careful however as more features imply larger models which are longer and harder to train.
  • Annotations quality: the annotations must be as precise and correct as possible in the context of the task at hand. Every blunder or outlier in a supervised dataset will slow down training and might result in unexpected behaviors of the trained model.

5.1.2 Diversity

Diversity is the most important aspect of a dataset because ML models are great at generalizing but bad at guessing in new scenarios. There are different aspects to diversity to keep in mind:

  • A well-defined task is crucial to identify all the various cases that we want our model to handle. Being as exhaustive as possible when selecting the training instances will accelerate training and improve the model by a lot
  • Balancing the dataset can also improve the training. When training on imbalanced datasets (i.e. when some cases are much more represented than others), the model will focus on the most represented situations as it will be the easiest and quickest way to get better results. There are ways of correcting this phenomenon, but it is always better to avoid it if possible when building the dataset.

5.1.3 Biases and Fairness

Biased

Refers to a model which always makes the same kind of wrong predictions in similar cases.

In practice, a model trained on biased data will most of the time repeat the biased results. This can have major consequences and shouldn’t be underestimated: even a cold-hearted ML algorithm is not objective if it wasn’t trained on objectively chosen and annotated data.

There exist model architectures, training and evaluation methods to try to prevent and detect biases. They can sometimes allow to build unbiased models using biased data. But this adds complexity to the training process and doesn’t always work.

5.2 Underfitting and Overfitting

5.2.1 Definitions

Underfitting

When a model is too simple to properly extract information from a complex task. Can also be explained by key information missing in the input features.

Overfitting

When a model is too complex to properly generalize to new data. Happens often when a NN is trained too long on a dataset that is not diverse enough and learns the noise in the data.

5.2.2 Illustrations

Underfitting and overfitting on a regression task[12]

Underfitting and overfitting on a regression task[12]

Underfitting and overfitting on a classification task[12]

Underfitting and overfitting on a classification task[12]

5.2.3 Solutions

Lever In case of underfitting In case of overfitting
Complexity Increase Reduce
Number of features Increase Reduce
Regularization Reduce Increase
Training time Increase Reduce

General strategies:

  • Cross-validation to identify problems
  • Grid search/random search to tune hyperparameters and balance between underfitting and overfitting
  • Ensemble methods to reduce overfitting by using many smaller models instead of one big

5.3 Interpretable and Explainable

5.3.1 Definitions

Interpretable

Qualifies a ML model which decision-making process is straightforward and transparent, making it directly understandable by humans. This requires to restrict the model complexity.

Explainable

Qualifies a ML model which decision-making process can be partly interpreted afterwards using post hoc interpretation techniques. These techniques are often used on models which are too complex to be interpreted.

6 Python Libraries

6.0.1 Data Manipulation

  • NumPy
    • Fast numerical operations
    • Matrices with any number of dimensions (called arrays)
    • Lots of convenient operators on arrays
  • Pandas
    • Can store any type of data
    • 1D or 2D tables (called DataFrames)
    • Lots of convenient operators on DataFrames

6.0.2 ML

  • SciPy
    • Scientific and technical computing based on NumPy
    • Lots of tools for optimization, integration, interpolation, etc.
  • scikit-learn
    • Simple and efficient tools for data mining and data analysis
    • Built on NumPy, SciPy and matplotlib

6.0.3 Visualization

  • Matplotlib
    • 2D plotting library (3D also possible)
    • Can create plots, histograms, power spectra, bar charts, error charts, scatterplots, etc.
    • Gallery
  • Plotly
    • Similar to Matplotlib but with interactive plots
    • Gallery
  • Seaborn
    • Data visualization library based on Matplotlib and Pandas
    • Very powerful for statistical data visualization
    • Gallery

7 Resources

7.1 References

[1] Vector: Zirguezi Original: Alisneaky. “SVM kernel trick: Map features to a higher dimensional space to be able to separate classes using a hyperplane.” Available at: https://commons.wikimedia.org/w/index.php?curid=47868867.
[2] Chire. “K-means clustering convergence process.” Available at: https://commons.wikimedia.org/w/index.php?curid=59409335.
[3] derivative work: Mhbrugman Stathis Sideris. “Hierarchical clustering.” Available at: https://commons.wikimedia.org/w/index.php?curid=7344806.
[4] Chire. “DBSCAN illustration.” Available at: https://commons.wikimedia.org/w/index.php?curid=17045963.
[5] Chire. “DBSCAN result.” Available at: https://commons.wikimedia.org/w/index.php?curid=17085332.
[6] Nicoguaro. “PCA illustration.” Available at: https://commons.wikimedia.org/w/index.php?curid=46871195.
[7] Kyle McDonald. “T-SNE result on MNIST dataset.” Available at: https://commons.wikimedia.org/w/index.php?curid=115726949.
[8] fdeloche. “Long short-term memory (LSTM).” Available at: https://commons.wikimedia.org/w/index.php?curid=60149410.
[9] fdeloche. “Gated recurrent unit (GRU).” Available at: https://commons.wikimedia.org/w/index.php?curid=60466441.
[10] Daniel Voigt Godoy. “Max pooling example.” Available at: https://github.com/dvgodoy/dl-visuals/.
[11] Gufosowa. “Cross-validation.” Available at: https://commons.wikimedia.org/w/index.php?curid=82298768.
[12] GeeksforGeeks. “Underfitting and overfitting in machine learning.” Available at: https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/.